Instructions

Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.

Cereal Data

Today, we start by looking at a collection of breakfast cereals:

With variables:

Produce a histogram of the sugar variable.

Now, compute the standard deviation of the variable sugar:

## [1] 4.378656

What are the units of this measurement?

Answer: grams (of sugar per serving)

Now, compute the deciles of the variable score:

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 18.0 28.0 31.0 34.5 37.0 40.0 42.0 48.0 53.0 58.0 84.0

What is the value of the 30th percentile. Describe what this means in words:

Answer: The value of the 30th percentile is about 34.5. This means that 30% of the cereals have below about 34.5 grams of sugar and 70% of the cereals have above about 34.5 grams of sugar.

Produce a boxplot of score and brand.

Which brand seems to have the healthiest cereals?

Answer: Nabisco seems to generally have the healthiest cereals.

Produce a boxplot of score and shelf.

Produce a boxplot of sugar and shelf.

If I want a healthy but reasonably sweet cereal which shelf would be the best to look on?

Answer: The top shelf would be best to look on. The top and bottom shelves generally have healthier options than the middle shelf. Considering that you want a reasonably sweet cereal and the bottom shelf generally has less sugar than the top shelf (as seen in the second boxplot), I conclude that the top shelf is the optimal shelf to search on.

Tea Reviews

Next, we will take another look at a dataset of tea reviews that I used in a previous lecture:

With variables: - name: the full name of the tea - type: the type of tea. One of: - black - chai - decaf - flavors - green - herbal - masters - matcha - oolong - pu_erh - rooibos - white - score: user rated score; from 0 to 100 - price: estimated price of one cup of tea - num_reviews: total number of online reviews

Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a regression line (recall: geom_smooth(method="lm")).

Does the score tend to increase, decrease, or remain the same as the number of reviews increases?

Answer: Scores tend to increase as the number of reviews increases.

Calculate the ventiles of the variable price.

##     0%     5%    10%    15%    20%    25%    30%    35%    40%    45% 
##   8.00  10.00  10.00  10.00  10.00  10.00  12.00  12.00  12.00  12.00 
##    50%    55%    60%    65%    70%    75%    80%    85%    90%    95% 
##  13.00  15.00  15.00  17.00  19.00  20.00  30.00  35.35  49.30  86.75 
##   100% 
## 196.00

What is the 80th percentile? Describe it in words, include the units of the problem in your answer.

Answer: The 80th percentile is 30 dollars. This means that 80% of the teas have an estimated price of less than 30 dollars, while 20% of the teas have an estimated price of greater than 30 dollars.

Plot the number of reviews (x-axis) against the score variable. Color the points according to price binned into 5 buckets.

What tends to be true about the number of reviews for the most expensive 20% of teas?

Answer: They tend not to have many reviews.

Create a dataset named white that consists of only white teas.

Calculate the standard deviation of the price for white teas and the standard deviation of the price for all of the teas.

## [1] 13.59444

Is the variation of the white tea prices smaller, larger, or about the same as the entire dataset?

Answer: The standard deviation of the entire dataset was previously calculated to be about 4.378656, which means that the standard devation of the price of only the white teas is much larger than that of the entire dataset.

Summarize the dataset by the type of tea and save the results as a variable named tea_type.

## # A tibble: 12 x 14
##       type score_mean price_mean num_reviews_mean score_median
##      <chr>      <dbl>      <dbl>            <dbl>        <dbl>
##  1   black   93.66667   23.37037         995.1852         94.0
##  2    chai   93.33333   12.00000        1069.4444         93.0
##  3   decaf   93.15385   15.00000         302.6154         94.0
##  4 flavors   92.00000   10.00000         890.8421         92.0
##  5   green   93.00000   17.94737         668.2632         93.0
##  6  herbal   93.19231   11.61538         916.1923         93.0
##  7 masters   94.60000  123.66667         114.5333         95.0
##  8  matcha   91.00000   60.00000         107.6667         92.0
##  9  oolong   93.50000   28.93750         635.8125         94.0
## 10  pu_erh   91.57143   20.57143         473.4286         92.0
## 11 rooibos   92.30769   11.73077         508.6538         92.5
## 12   white   92.70588   26.94118         632.8235         93.0
## # ... with 9 more variables: price_median <dbl>, num_reviews_median <dbl>,
## #   score_sd <dbl>, price_sd <dbl>, num_reviews_sd <dbl>, score_sum <int>,
## #   price_sum <int>, num_reviews_sum <int>, n <int>

Plot the average price (x-axis) against the average score (y-axis) of each type of tea. Make the size of the points proportional to the number of teas in each category and label the points with geom_text_repel and the tea type.

Describe an interesting pattern or set of outliers that you found in the previous plot. This does not need to take more than 1-2 sentences.

Answer: One interesting outlier is the matcha type teas, of which there are not very many in number. The price of the matcha teas are generally much higher than the other types, but also have the lowest average score.